Status: Draft
Feb 1, 2021
The racial and ethnic disparities in the COVID-19 pandemic have exposed longstanding health inequities in the U.S., which have been uncovered by multiple analyses of COVID-19 data from the Covid Tracking Project, New York Times, American Public Media Research Lab, Kaiser Family Foundation, National Public Radio, and Brookings Institution among many others.
Unfortunately, we still don't have a full understanding of the degree of these disparities because of the fragmented landscape of race/ethnicity data. On Jan 29, the Covid Tracking Project wrote "the continued lack of either complete federal demographic data or federal guidelines for what states should publish make it impossible to fully understand who is being infected with and dying of COVID-19" and that we are still missing race information for about a third of COVID-19 cases. The most reliable and up-to-date data is scattered across individual state public health websites that use different standards and categories for reporting race/ethnicity. Collecting this data has largely been left to non-governmental organizations like the Covid Tracking Project, which announced that they will stop collecting data on March 7, 2021, a full year after they started. Even the federal government has looked to the Covid Tracking Project for reliable data on race/ethnicity for COVID-19 data. The office of the Assistant Secretary for Planning and Evaluation, an agency within Health and Human Services, wrote in Oct 2020 that "The volunteer-based COVID tracking project has created the most comprehensive centralized resource for race and ethnicity data at the state level."
For case data in particular, more so than deaths data, the CDC has only published public race/ethnicity data at the U.S. level, not the state or county levels. In July 2020, the New York Times published The Fullest Look Yet at the Racial Inequity of Coronavirus, a one-time analysis of CDC data obtained via FOIA and legal action that contained county-level case data with race/ethnicity up to May 28, 2020. Several non-governmental organizations have taken it upon themselves to gather data for total case counts at the county level (New York Times, Johns Hopkins University, USAFacts), but none of them have collected race/ethnicity data; this would be a huge undertaking due to the non-standard reporting of race/ethnicity across state and local public health websites.
In Nov 2020, the CDC made some of the case data that the New York Times obtained public: county-level totals in a dashboard and public data about race/ethnicity with additional dashboards, but without state and county details. They also released restricted access data with race/ethnicity, state, and county available upon request. The CDC's initial restricted access data agreement did not allow for county-level analyses to be made public, but an updated data agreement from Dec 14, 2020 allowed such public analyses. In Jan 2021, the Morehouse School of Medicine's Satcher Health Leadership Institute (MSM/SHLI) in collaboration with Google.org applied for and got access to this data.
The CDC Restricted Access data enabled us to complete the first analysis of race/ethnicity disparities across the U.S. at the county level since the New York Times analysis in July. However, the data has significant completeness issues; e.g., only 55% of cases have known race/ethnicity.
The goal of this analysis is to assess the completeness of the CDC's Restricted Access data and its feasibility in examining disparities in race/ethnicity for COVID-19 cases at the county level. We will first assess the completeness of the data on its own by looking at which fields are viable for analysis. We will next compare the total case counts in the restricted access data to two reliable public datasets at the state and county levels. We will also compare cases with race/ethnicity at the state level to the Covid Tracking Project's Covid Racial Data Tracker data.
The top-level data completeness findings are:
We didn't include the following in this report:
The restricted access data contains 32 fields, which are described on the CDC website. The public version of the restricted access data contains 12 of those fields. The data comes from this case report form that is a dense, two-page form to get information about each lab-confirmed or probable COVID-19 case. The CDC has extensive FAQs about this surveillance data, one of which is about completeness:
How complete are the data that the CDC receives about COVID-19 cases?
The COVID-19 pandemic has put unprecedented demands on the public health data supply chain. In many states, the large number of COVID-19 cases has severely strained the ability of hospitals, healthcare providers, and laboratories to report cases with complete demographic information, such as race and ethnicity. The unprecedented volume of cases has also limited the ability of state and local health departments to conduct thorough case investigations and collect all requested case data.
As a result, many COVID-19 case notifications submitted to CDC do not have complete information on patient demographics; signs and symptoms of illness; underlying health conditions; characteristics of hospitalizations such as ventilator use; clinical outcomes; exposures; and factors that may put people at higher risk for severe disease. Because it can be time-consuming for jurisdictions to collect the additional information, these data can lag behind the aggregate counts. Because of missing data, analyses of these data elements are likely an underestimate of the true occurrence.
#@title
import pandas as pd
import altair as alt
from vega_datasets import data
from google.colab import auth
auth.authenticate_user()
# Turn off the three-dot menu for Altair/Vega charts.
alt.renderers.set_embed_options(actions=False)
#%load_ext google.colab.data_table
#@title
def FieldAnalysis(project_id, table, field_list):
dict = {}
for field in field_list:
dict[field] = [0.0, 0.0, 0.0, 0.0]
unknowns = pd.DataFrame(dict, index=['Unknown', 'Missing', 'NA', 'Known'])
field_series = []
value_series = []
percent_series = []
for field in field_list:
field_unknowns_query = ('''
SELECT
%s,
count(*) as cases
FROM
%s
GROUP BY
%s
''')
query = field_unknowns_query % (field, table, field)
field_unknowns_df = pd.io.gbq.read_gbq(query, project_id=project_id)
field_unknowns_df.set_index(field, inplace=True)
field_unknowns_df.index = field_unknowns_df.index.fillna('Null')
missing_count = 0
if 'Missing' in field_unknowns_df.index:
missing_count += field_unknowns_df.loc['Missing'].cases
if 'Null' in field_unknowns_df.index:
missing_count += field_unknowns_df.loc['Null'].cases
#if field_unknowns_df.index.isnull().any():
# missing_count += field_unknowns_df.loc[field_unknowns_df.index.isnull()].cases
unknowns.loc['Missing', field] = missing_count / field_unknowns_df.cases.sum()
if 'Unknown' in field_unknowns_df.index:
unknowns.loc['Unknown', field] = field_unknowns_df.loc['Unknown'].cases / field_unknowns_df.cases.sum()
if 'NA' in field_unknowns_df.index:
unknowns.loc['NA', field] = field_unknowns_df.loc['NA'].cases / field_unknowns_df.cases.sum()
unknowns.loc['Known', field] = 1 - (unknowns.loc['Missing', field] +
unknowns.loc['Unknown', field] +
unknowns.loc['NA', field])
field_series.extend([field, field, field, field])
value_series.extend(['Known', 'Supressed', 'Unknown', 'Missing'])
percent_series.extend([unknowns.loc['Known', field],
unknowns.loc['NA', field],
unknowns.loc['Unknown', field],
unknowns.loc['Missing', field]])
test = pd.DataFrame.from_dict({'field': field_series,
'value': value_series,
'percent': percent_series})
return alt.Chart(test).mark_bar().encode(
x=alt.X('percent', axis=alt.Axis(format='%'), title=''),
y=alt.Y('field', sort='x', title='Field'),
color=alt.Color('value', scale=alt.Scale(scheme='category20'), title='Value'),
order=alt.Order('field:N'),
tooltip=[
alt.Tooltip('field:N', title='Field'),
alt.Tooltip('value:N', title='Value'),
alt.Tooltip('percent:Q', format=',.0%', title='Percent'),
]
)
Based on our analysis of the CDC data up to Dec 16, 2020, the only fields that are available for more than 50% of the cases are the date that the case was first reported to the CDC, the status of the case (lab-confirmed or probable), state, county, sex, age, and race/ethnicity, which are shown in the chart below. All other fields, including whether the person died or was hospitalized, are known for fewer than 50% of the cases.
#@title
field_list = ['cdc_case_earliest_dt', 'current_status', 'res_state', 'res_county', 'sex', 'age_group', 'race_ethnicity_combined']
project_id = 'msm-secure-data-1b'
table = '`msm-secure-data-1b.ndunlap_secure.cdc_restricted_access_20201231`'
FieldAnalysis(project_id, table, field_list).display()
Race/ethnicity is known for only 55% of cases, while the other fields above are known for 97%-99% of cases. The 45% of cases without known race/ethnicity were either marked as "Unknown" on the case report form (35%), missing due to being left blank on the form (4%), or suppressed for privacy reasons for small geographic and/or demographic population groups (2%).
The CDC discussed the incompleteness of race/ethnicity data in their case data FAQs:
Most states have demographic factors like age and sex for most reported cases. However, in many states, the large number of COVID-19 cases has severely strained the ability to report cases with complete demographic information for race and ethnicity. With thousands of cases being reported, completeness of these elements is unlikely to improve in the immediate future for some jurisdictions.
Has the data gotten more complete since the New York Times obtained a copy of the case surveillance data in May, 2020? Based on the comparison table below, the data has improved in terms of more counties included and a higher percent of cases with race/ethnicity and county; however, some of those differences may be due to the fact that there are simply more counties with COVID-19 cases in the more recent data. The percent of cases included in the data has also improved as compared to totals from the Covid Tracking Project (CTP).
#@title
# Manually update these fields based on the latest CDC data.
row_names = [
'Update frequency',
'Data as of date',
'Cases in CTP as of date',
'Cases in data as of date',
'(as a % of CTP)',
'Number of counties',
'(as a % of all counties)',
'U.S. population in those counties',
'(as a % of total U.S population – States + D.C.)',
'Cases with known race/ethnicity and county',
'(as a % of cases in data)',
'Access'
]
nyt_cdc_metadata = [
'Once',
'May 28, 2020',
'1.7M',
'1.4M',
'(88%)',
'974',
'(31%)',
'178M',
'(~55%)',
'~0.6M',
'(44%)',
'Not public'
]
cdc_metadata = [
'Monthly',
'Dec 16, 2020',
'16.8M',
'13.4M',
'(80%)',
'3,052',
'(96%)',
'324M',
'(99.8%)',
'7.2M',
'(54%)',
'Restricted'
]
table_data = {'NYT/CDC': nyt_cdc_metadata, 'CDC': cdc_metadata}
metadata_df = pd.DataFrame(table_data, index=row_names)
metadata_df.head(15)
Sources: NYT article and The Daily podcast episode about the article, CTP total case counts for the U.S. by date.
The first step to evaluating the completeness of the CDC data is to check the total case counts at the U.S., state, and county levels against known accurate data sources that aggregate state and local public health websites. The CDC case data FAQs say that we should not expect case data to always match the more accurate aggregate data, but that's a tradeoff we must make to get more detailed demographic information:
Aggregate counts provide the most up-to-date validated numbers on cases and deaths.
CDC receives the line-level data primarily from state health departments without personal identifiers such as names or home addresses. Because it can be time-consuming for jurisdictions to collect the additional information, these data can lag behind the aggregate counts. Although CDC receives this information for most cases, it does not receive it for all cases.
Many public health websites do contain race/ethnicity details, but they do not all use the same standard race/ethnicity categories (CRDT analysis). So, we may need to sacrifice accuracy and timeliness to get standardized race/ethnicity data on cases across states and counties.
We will compare the CDC data against two sources of aggregate data: The Covid Racial Data Tracker (CRDT) and the NYT's public data, which are both updated on a regular basis (CRDT twice a week, NYT daily) and come from state and local public health websites. CRDT is the only source for case data with race/ethnicity breakdowns, but there are several sources for county-level aggregate case data in addition to the NYT, such as JHU and USAFacts (this paper analyzes the differences between those sources at the state level up to July for cases and deaths).
The table below compares geographic vs. race/ethnicity availability for these three different data sources:
#@title
row_names = [
'Total Cases — States',
'Total Cases — Counties',
'Cases by Race/Ethnicity — States',
'Cases by Race/Ethnicity — Counties'
]
nyt_yn = [
'✅',
'✅',
'❌',
'❌',
]
crdt_yn = [
'✅',
'❌',
'✅',
'❌',
]
cdc_yn = [
'✅',
'✅',
'✅',
'✅',
]
table_data = {'NYT': nyt_yn, 'CRDT': crdt_yn, 'CDC': cdc_yn}
availability_df = pd.DataFrame(table_data, index=row_names)
availability_df.head()
Because the CDC is the only data source that has race/ethnicity at the county level, the most similar data for purposes of comparison are (1) CRDT data at the state level with race/ethnicity, and (2) NYT data at the county level with no race/ethnicity.
We will compare across these data sources up to Dec 16, 2020, which is the latest reporting date in the CDC data. We expect to see differences in the case counts due to lags in reporting the data, but we don't expect that time lags can explain large percentages of missing cases.
To get a baseline of how much we could expect the CDC case counts to match the CRDT or NYT, we can see how closely the CRDT and NYT match each other. Each dot below is a state (hover to see details), and the black line shows where the NYT and CRDT case counts are equal.
#@title
CASES = 'Cases'
DATASET = 'cdc'
metric = CASES
project_id = 'msm-secure-data-1b'
cdc_table = '`%s.ndunlap_secure.cdc_restricted_access_20201231`' % project_id
date = 'DATE(2020, 12, 16)'
date_int = '20201216'
date_display_name = 'Dec 16'
# Chart settings.
total_cases_scale_max = 2000000
scatter_height = 300
scatter_width = 300
map_height = 350
map_width = 450
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')
territories = ('PR', 'GU', 'VI', 'MP', 'AS')
#@title
crdt_query = ('''
SELECT
State as state,
Cases_Total as crdt_cases,
Cases_Total - Cases_Unknown as crdt_known_race_cases,
ROUND(1 - Cases_Unknown / Cases_Total, 4) as crdt_known_race_cases_percent,
FROM `msm-secure-data-1b.ndunlap_secure.crdt`
WHERE
date = %s
''' % date_int)
nyt_states_query = ('''
SELECT
state_name,
state_fips_code,
confirmed_cases as nyt_cases,
deaths as nyt_deaths
FROM `bigquery-public-data.covid19_nyt.us_states`
WHERE
date = %s AND
state_fips_code IS NOT NULL
''' % date)
nyt_counties_query = ('''
SELECT
county_fips_code,
confirmed_cases as nyt_cases,
FROM `bigquery-public-data.covid19_nyt.us_counties`
WHERE
date = %s AND
county_fips_code IS NOT NULL
''' % date)
cdc_states_query = ('''
SELECT
res_state,
COUNT(*) as cdc_cases
FROM
%s
GROUP BY
res_state
''' % cdc_table)
cdc_counties_query = ('''
SELECT
res_state,
res_county,
race_ethnicity_combined,
COUNT(*) as cases
FROM
%s
GROUP BY
res_county,
res_state,
race_ethnicity_combined
''' % cdc_table)
#@title
def CreateScatterPlot(
chart_df, fields_dict, title, scale_max, height, width, geo, metric_type):
geo_field = 'state'
geo_field_display_name = 'State'
if geo == 'county':
geo_field = 'state_county'
geo_field_display_name = 'County'
if metric_type == 'ratio':
scale_scheme = 'blueorange'
scale_reverse = True
scale_domain = [0, 2]
legend_format = '.1f'
axis_format = ',.0f'
elif metric_type == 'percent':
scale_scheme = 'redyellowblue'
scale_reverse = False
scale_domain = [0, 1]
legend_format = '.0%'
axis_format = '.0%'
tooltips = [alt.Tooltip(geo_field + ':N', title=geo_field_display_name)]
for field in ('y', 'x', 'percent'):
tooltips.append(alt.Tooltip(
fields_dict[field]['name'] + ':Q',
format=fields_dict[field]['format'],
title=fields_dict[field]['title'],
))
plot = alt.Chart(chart_df).mark_circle(size=60).encode(
alt.X(fields_dict['x']['name'] + ':Q', axis=alt.Axis(title=fields_dict['x']['title'], format=axis_format),
scale=alt.Scale(domain=(0, scale_max))
),
alt.Y(fields_dict['y']['name'] + ':Q', axis=alt.Axis(title=fields_dict['y']['title'], format=axis_format),
scale=alt.Scale(domain=(0, scale_max))
),
color=alt.Color(fields_dict['percent']['name'],
type='quantitative',
scale=alt.Scale(scheme=scale_scheme,
reverse=scale_reverse,
domain=scale_domain,
clamp=True),
legend=alt.Legend(format=legend_format),
title=metric_type.capitalize()),
tooltip=tooltips,
).properties(
height=height,
width=width,
)
if metric_type == 'ratio':
plot.interactive()
line = pd.DataFrame({
'x': [0, scale_max],
'y': [0, scale_max],
})
if metric_type == 'ratio':
line_plot = alt.Chart(line).mark_line(color='black').encode(
x='x',
y='y',
)
elif metric_type == 'percent':
line_plot = (
alt.Chart(pd.DataFrame({'x': [.5]})).mark_rule().encode(y='x') +
alt.Chart(pd.DataFrame({'y': [.5]})).mark_rule().encode(x='y')
)
# Add interative for concatenating due to https://github.com/altair-viz/altair/issues/2010.
scatter = (plot + line_plot).properties(
title=title,
height=height,
width=width,
).interactive()
return scatter
def CreateMap(
chart_df, fields_dict, title, scale_max, height, width, geo, metric_type):
geo_field = 'state'
geo_field_display_name = 'State'
fips_code = 'state_fips_code'
topo_feature = us_states
if geo == 'county':
geo_field = 'state_county'
geo_field_display_name = 'County'
fips_code = 'county_fips'
topo_feature = us_counties
if metric_type == 'ratio':
scale_scheme = 'blueorange'
scale_reverse = True
scale_domain = [0, 2]
legend_format = '.1f'
elif metric_type == 'percent':
scale_scheme = 'redyellowblue'
scale_reverse = False
scale_domain = [0, 1]
legend_format = '.0%'
highlight = alt.selection_single(on='mouseover', fields=['id', fips_code], empty='none')
tooltips = [alt.Tooltip(geo_field + ':N', title=geo_field_display_name)]
for field in ('y', 'x', 'percent'):
tooltips.append(alt.Tooltip(
fields_dict[field]['name'] + ':Q',
format=fields_dict[field]['format'],
title=fields_dict[field]['title'],
))
field_names = [geo_field]
field_names.extend([fields_dict[field]['name'] for field in fields_dict])
plot = alt.Chart(topo_feature).mark_geoshape(
stroke='white',
strokeOpacity=.2,
strokeWidth=1
).project(
type='albersUsa'
).transform_lookup(
lookup='id',
from_=alt.LookupData(chart_df, fips_code, field_names)
).encode(
alt.Color(fields_dict['percent']['name'],
type='quantitative',
legend=alt.Legend(format=legend_format),
scale=alt.Scale(scheme=scale_scheme,
reverse=scale_reverse,
domain=scale_domain,
clamp=True,
),
title=metric_type.capitalize()),
tooltip=tooltips
).add_selection(
highlight,
)
states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
type='albersUsa'
)
states_fill = alt.Chart(us_states).mark_geoshape(
fill='#F1F1F1',
stroke='white'
).project('albersUsa')
layered_map = alt.layer(states_fill, plot, states_outline).properties(
height=height,
width=width,
title=title,
)
return layered_map
def CreateScatterPlotAndMap(
chart_df, fields_dict, title, total_cases_scale_max, scatter_height, scatter_width, map_width, geo, metric_type):
scatter = CreateScatterPlot(
chart_df, fields_dict, title, total_cases_scale_max, scatter_height, scatter_width, geo, metric_type)
map = CreateMap(
chart_df, fields_dict, title, total_cases_scale_max, scatter_height, map_width, geo, metric_type)
return (scatter | map).configure_view(
strokeWidth=0,
).configure_mark(
stroke='grey'
).configure_legend(
gradientLength=scatter_height - 50
)
def PrintSummaryStats(chart_df, field='percent'):
below_15 = len(chart_df[chart_df[field] < .85]) / len(chart_df)
above_15 = len(chart_df[chart_df[field] > 1.15]) / len(chart_df)
print('between +/-15%: ', round(1 - below_15 - above_15, 2))
below_50 = len(chart_df[chart_df[field] < .5]) / len(chart_df)
above_50 = len(chart_df[chart_df[field] > 1.55]) / len(chart_df)
print('between +/-50%: ', round(1 - below_50 - above_50, 2))
print('< than .50: ', len(chart_df[chart_df[field] < .5]))
print('> than 1.50: ', len(chart_df[chart_df[field] > 1.5]))
print(chart_df[field].describe())
#@title
states_to_fips = {'AL': 1, 'AK': 2, 'AZ': 4, 'AR': 5, 'AS': 3, 'CA': 6, 'CO': 8, 'CT': 9, 'DC': 11, 'DE': 10, 'FL': 12, 'GA': 13, 'GU': 14, 'HI': 15, 'ID': 16, 'IL': 17, 'IN': 18, 'IA': 19, 'KS': 20, 'KY': 21, 'LA': 22, 'ME': 23, 'MD': 24, 'MA': 25, 'MI': 26, 'MN': 27, 'MS': 28, 'MO': 29, 'MT': 30, 'NE': 31, 'NV': 32, 'NH': 33, 'NJ': 34, 'NM': 35, 'NY': 36, 'NYC': 36, 'NC': 37, 'ND': 38, 'OH': 39, 'OK': 40, 'OR': 41, 'PA': 42, 'PR': 43, 'RI': 44, 'SC': 45, 'SD': 46, 'TN': 47, 'TX': 48, 'UT': 49, 'VT': 50, 'VA': 51, 'VI': 52, 'WA': 53, 'WV': 54, 'WI': 55, 'WY': 56, 'AS': 60, 'GU': 66, 'MP': 69, 'PR': 72, 'VI': 78}
crdt_df = pd.io.gbq.read_gbq(crdt_query, project_id=project_id)
crdt_df.set_index('state', inplace=True)
nyt_states_df = pd.io.gbq.read_gbq(nyt_states_query, project_id=project_id)
nyt_states_df.state_fips_code.unique()
nyt_territories = ('Puerto Rico', 'Guam', 'Virgin Islands', 'Northern Mariana Islands', 'American Samoa')
for territory in nyt_territories:
nyt_states_df = nyt_states_df[nyt_states_df.state_name != territory]
nyt_states_df['state_fips_code'] = nyt_states_df.state_fips_code.astype(int)
nyt_states_df.set_index('state_fips_code', inplace=True)
crdt_df.reset_index(inplace=True)
crdt_df['state_fips_code'] = crdt_df.state
crdt_df = crdt_df.replace(to_replace={'state_fips_code': states_to_fips})
crdt_df.set_index('state_fips_code', inplace=True)
nyt_crdt_merged_df = nyt_states_df.join(crdt_df, on="state_fips_code", how='inner', lsuffix='_left', rsuffix='_right')
nyt_crdt_merged_df['percent'] = round(nyt_crdt_merged_df.nyt_cases / nyt_crdt_merged_df.crdt_cases, 2)
nyt_crdt_merged_df
nyt_crdt_merged_df.reset_index(inplace=True)
below_15 = len(nyt_crdt_merged_df[nyt_crdt_merged_df.percent < .85]) / len(nyt_crdt_merged_df)
above_15 = len(nyt_crdt_merged_df[nyt_crdt_merged_df.percent > 1.15]) / len(nyt_crdt_merged_df)
#print('between +/-15%: ', round(1 - below_15 - above_15, 2))
#nyt_crdt_merged_df.percent.describe()
#@title
nyt_crdt_fields_dict = {
'x': {'name': 'crdt_cases', 'format': ',', 'title': 'CRDT cases'},
'y': {'name': 'nyt_cases', 'format': ',', 'title': 'NYT cases'},
'percent': {'name': 'percent', 'format': '.2f', 'title': 'Ratio of NYT to CRDT'},
}
nyt_crdt_title = 'Ratio of NYT to CRDT Cases by State as of %s' % date_display_name
CreateScatterPlotAndMap(
nyt_crdt_merged_df, nyt_crdt_fields_dict, nyt_crdt_title, total_cases_scale_max, scatter_height, scatter_width, map_width, 'state', 'ratio'
).display()
The ratio of NYT to CRDT cases is between 0.97 and 1.11 for all states:
We can see below that the CDC case counts differ from the CRDT case counts much more drastically than the NYT did.
#@title
cdc_states_df = pd.io.gbq.read_gbq(cdc_states_query, project_id=project_id)
cdc_states_df.rename(columns={'res_state': 'state'}, inplace=True)
cdc_states_df.set_index('state', inplace=True)
crdt_df = pd.io.gbq.read_gbq(crdt_query, project_id=project_id)
for territory in territories:
crdt_df = crdt_df[crdt_df.state != territory]
crdt_df.set_index('state', inplace=True)
cdc_crdt_merged_df = cdc_states_df.join(crdt_df, on="state", how='inner', lsuffix='_left', rsuffix='_right')
cdc_crdt_merged_df.reset_index(inplace=True)
cdc_crdt_merged_df['state_fips_code'] = cdc_crdt_merged_df.state
cdc_crdt_merged_df = cdc_crdt_merged_df.replace(to_replace={'state_fips_code': states_to_fips})
cdc_crdt_merged_df['percent'] = round(cdc_crdt_merged_df.cdc_cases / cdc_crdt_merged_df.crdt_cases, 4)
# PrintSummaryStats(cdc_crdt_merged_df)
#@title
cdc_crdt_fields_dict = {
'x': {'name': 'crdt_cases', 'format': ',', 'title': 'CRDT cases'},
'y': {'name': 'cdc_cases', 'format': ',', 'title': 'CDC cases'},
'percent': {'name': 'percent', 'format': '.2f', 'title': 'Ratio of CDC to CRDT'},
}
cdc_crdt_title = 'Ratio of CDC to CRDT Cases by State as of %s' % date_display_name
CreateScatterPlotAndMap(
cdc_crdt_merged_df, cdc_crdt_fields_dict, cdc_crdt_title, total_cases_scale_max, scatter_height, scatter_width, map_width, 'state', 'ratio'
).display()
The ratio of CDC to CRDT cases is between 0.03 and 1.64 for all states + D.C.:
The 32 states that were within +/-15% of the CRDT data could plausibly be off due to time lags in reporting cases to the CDC vs. reporting them on state public health websites, but there are many outlier states that are too far off from the CRDT case counts to be explained by a time lag:
#@title
# CDC vs. NYT county
df = pd.io.gbq.read_gbq(cdc_counties_query, project_id=project_id)
for territory in territories:
df = df[df.res_state != territory]
project_id = 'msm-internal-data'
df_county_fips_map = pd.io.gbq.read_gbq(f'''
SELECT
*
FROM
# `msm-internal-data.ipums_acs.acs_2019_5year_county`
`msm-internal-data.crew.county_fips_mapping`
''', project_id=project_id)
df_county_fips_map.crew_county = df_county_fips_map.crew_county.str.lower()
df_county_fips_map['state_county'] = df_county_fips_map.state + '-' + df_county_fips_map.crew_county
df_county_fips_map['state_county'] = df_county_fips_map.state_county.astype('string').str.strip()
df_county_fips_map.set_index('state_county', inplace=True)
#@title
# Concatenate the state and county names because county names are not unique across states.
df.res_county = df.res_county.str.lower()
df['state_county'] = df.res_state + '-' + df.res_county
df['state_county'] = df.state_county.astype('string').str.strip()
df.set_index('state_county', inplace=True)
df['race_ethnicity_combined'] = df.race_ethnicity_combined.astype('string').str.strip()
race_ethnicity_combined_map = {
'Asian, Non-Hispanic': 'asian_cases',
'Black, Non-Hispanic': 'black_cases',
'White, Non-Hispanic': 'white_cases',
'American Indian/Alaska Native, Non-Hispanic': 'aian_cases',
'Hispanic/Latino': 'hispanic_cases',
'Multiple/Other, Non-Hispanic': 'other_cases',
'Native Hawaiian/Other Pacific Islander, Non-Hispanic': 'nhpi_cases',
'Missing': 'unknown_cases',
'Unknown': 'unknown_cases',
'NA': 'na_cases',
}
df = df.replace(to_replace={'race_ethnicity_combined': race_ethnicity_combined_map})
#@title
merged_df = df.join(df_county_fips_map, on="state_county", how='inner', lsuffix='_left', rsuffix='_right')
# Create a crosstab table with rows = counties, columns = race_ethnicity_combined.
crosstab_df = pd.crosstab(merged_df['county_fips'], merged_df.race_ethnicity_combined, values=merged_df.cases, aggfunc=sum,
margins=True,
margins_name='total_cases'
)
# Have to reset_index() to go from pandas multi-index to single index.
crosstab_df = crosstab_df.reset_index()
crosstab_df.drop(axis=0, index=len(crosstab_df) - 1, inplace=True)
crosstab_df['county_fips'] = crosstab_df.county_fips.astype(int)
crosstab_df['total_known_cases'] = crosstab_df['total_cases'] - crosstab_df.unknown_cases.fillna(0)
crosstab_df['total_known_cases'] = crosstab_df['total_cases'] - crosstab_df.na_cases.fillna(0) - crosstab_df.unknown_cases.fillna(0)
#@title
df_acs_name_lookup = pd.io.gbq.read_gbq(f'''
SELECT
*
FROM
`msm-internal-data.ipums_acs.acs_2019_5year_county`
''', project_id=project_id)
df_acs_name_lookup['state_county'] = df_acs_name_lookup.county.astype('string').str.strip() + ', ' + df_acs_name_lookup.state.astype('string').str.strip()
df_acs_name_lookup.drop(columns=['state', 'county'], inplace=True)
df_acs_name_lookup.set_index('county_fips', inplace=True)
county_chart_df = crosstab_df.join(df_acs_name_lookup, on="county_fips", how='inner', lsuffix='_left', rsuffix='_right')
county_chart_df.county_fips = county_chart_df.county_fips.astype(int)
#@title
#print(len(county_chart_df))
#print(county_chart_df.total_pop.sum())
#print(county_chart_df.total_pop.sum() / 324697795) # Population covered in these counties
#print(0.55 * 324697795) # NYT population
#@title
nyt_counties_df = pd.io.gbq.read_gbq(nyt_counties_query, project_id=project_id)
nyt_counties_df.rename(columns={'county_fips_code': 'county_fips'}, inplace=True)
nyt_counties_df.county_fips.unique()
nyt_counties_df['county_fips'] = nyt_counties_df.county_fips.astype(int)
nyt_counties_df.set_index('county_fips', inplace=True)
county_chart_df.set_index('county_fips', inplace=True)
nyt_merged_df = county_chart_df.join(nyt_counties_df, on="county_fips", how='inner', lsuffix='_left', rsuffix='_right')
nyt_merged_df = nyt_merged_df.reset_index()
nyt_merged_df['percent'] = round(nyt_merged_df.total_cases / nyt_merged_df.nyt_cases, 2)
#PrintSummaryStats(nyt_merged_df)
We can do the same analysis at the county level using the CDC vs. NYT data.
Each dot is a county (hover to see details). We show all 3,045 available counties on the left and zoom in on the smaller counties on the right.
#@title
cdc_nyt_fields_dict = {
'x': {'name': 'nyt_cases', 'format': ',', 'title': 'NYT cases'},
'y': {'name': 'total_cases', 'format': ',', 'title': 'CDC cases'},
'percent': {'name': 'percent', 'format': '.2f', 'title': 'Ratio of CDC to NYT'},
}
cdc_nyt_title = 'Ratio of CDC Cases to NYT Cases by County as of Dec 16'
zoom_cdc_nyt_title = 'Zoom in on counties with up to 100,000 population'
scatter = CreateScatterPlot(
nyt_merged_df, cdc_nyt_fields_dict, cdc_nyt_title, 700000, scatter_height, scatter_width, 'county', 'ratio'
)
zoom_scatter = CreateScatterPlot(
nyt_merged_df, cdc_nyt_fields_dict, zoom_cdc_nyt_title, 100000, scatter_height, scatter_width, 'county', 'ratio'
)
(scatter | zoom_scatter).configure_view(
strokeWidth=0,
).configure_legend(
gradientLength=map_height - 50
).configure_mark(
stroke='grey'
).display()
The ratio of CDC to NYT cases is between 0.00 and 9.80 for the 3,045 counties in the CDC data:
We can also view these ratios on the map on the right and compare them to the state-level totals map from the previous section on the left.
#@title
cdc_nyt_fields_dict = {
'x': {'name': 'nyt_cases', 'format': ',', 'title': 'NYT cases'},
'y': {'name': 'total_cases', 'format': ',', 'title': 'CDC cases'},
'percent': {'name': 'percent', 'format': '.2f', 'title': 'Ratio of CDC to NYT'},
}
cdc_nyt_title = 'Ratio of CDC Cases to NYT Cases by County as of Dec 16'
cdc_nyt_map = CreateMap(
nyt_merged_df, cdc_nyt_fields_dict, cdc_nyt_title, total_cases_scale_max, map_height, map_width, 'county', 'ratio'
)
cdc_crdt_map = CreateMap(
cdc_crdt_merged_df, cdc_crdt_fields_dict, cdc_crdt_title, total_cases_scale_max, map_height, map_width, 'state', 'ratio'
)
(cdc_crdt_map | cdc_nyt_map).configure_view(
strokeWidth=0,
).configure_legend(
gradientLength=map_height - 50
).display()
Notes:
We can see that the ratio of the CDC case data to CRDT/NYT aggregate data is highly variable across the U.S., but there is less variability across the counties within each state. This pattern indicates that the data completeness issues may be due to state-level policies or data collection processes rather than at the county level. We can also see that some counties are missing entirely from the data; e.g., in Texas, Wyoming, West Virginia, and Nebraska. It's possible some of these counties have cases in the data but the county name was suppressed for privacy reasons due to small population sizes. Even so, those cases would still have a state name, so they would be captured in the map on the left above.
#@title
compare_cases_unknowns_query = ('''
SELECT
res_state,
race_ethnicity_combined,
COUNT(*) as cdc_cases
FROM
%s
GROUP BY
res_state,
race_ethnicity_combined
''' % cdc_table)
states_df = pd.io.gbq.read_gbq(compare_cases_unknowns_query, project_id=project_id)
for state in ('Unknown', 'NA', 'OCONUS'):
states_df = states_df[states_df.res_state != state]
states_df['race_ethnicity_combined'] = states_df.race_ethnicity_combined.astype('string').str.strip()
states_df = states_df.replace(to_replace={'race_ethnicity_combined': {
'Asian, Non-Hispanic': 'cdc_known_cases',
'Black, Non-Hispanic': 'cdc_known_cases',
'White, Non-Hispanic': 'cdc_known_cases',
'American Indian/Alaska Native, Non-Hispanic': 'cdc_known_cases',
'Hispanic/Latino': 'cdc_known_cases',
'Multiple/Other, Non-Hispanic': 'cdc_known_cases',
'Native Hawaiian/Other Pacific Islander, Non-Hispanic': 'cdc_known_cases',
'Missing': 'cdc_unknown_cases',
'Unknown': 'cdc_unknown_cases',
'NA': 'cdc_na_cases',
}})
states_df.rename(columns={'res_state': 'state'}, inplace=True)
#@title
crosstab_df = pd.crosstab(states_df['state'], states_df.race_ethnicity_combined, values=states_df.cdc_cases, aggfunc=sum,
margins=True,
margins_name='cdc_cases'
)
# Have to reset_index() to go from pandas multi-index to single index.
crosstab_df = crosstab_df.reset_index()
crosstab_df.drop(axis=0, index=len(crosstab_df) - 1, inplace=True)
crosstab_df['cdc_known_or_na_cases'] = crosstab_df['cdc_cases'] - crosstab_df.cdc_unknown_cases.fillna(0)
crosstab_df['cdc_known_cases'] = crosstab_df['cdc_cases'] - crosstab_df.cdc_na_cases.fillna(0) - crosstab_df.cdc_unknown_cases.fillna(0)
crosstab_df
crdt_merged_df = crosstab_df.join(crdt_df, on="state", how='inner', lsuffix='_left', rsuffix='_right')
crdt_merged_df.reset_index(inplace=True)
crdt_merged_df['state_fips_code'] = crdt_merged_df.state
crdt_merged_df = crdt_merged_df.replace(to_replace={'state_fips_code': states_to_fips})
crdt_merged_df['cdc_known_cases_percent'] = round(crdt_merged_df.cdc_known_cases / crdt_merged_df.cdc_cases, 4)
crdt_merged_df['cdc_known_or_na_cases_percent'] = round(crdt_merged_df.cdc_known_or_na_cases / crdt_merged_df.cdc_cases, 4)
crdt_merged_df['percent'] = round(crdt_merged_df.cdc_cases / crdt_merged_df.crdt_cases, 4)
crdt_merged_df['percent_known_cases'] = round(crdt_merged_df.cdc_known_cases / crdt_merged_df.crdt_known_race_cases, 4)
crdt_merged_df_no_ny = crdt_merged_df[crdt_merged_df.state != 'NY']
#PrintSummaryStats(crdt_merged_df_no_ny)
When evaluating the percent of cases that report on race/ethnicity in the CDC data, we also need to consider the 2% of overall cases with race/ethnicity that were suppressed due to privacy reasons. We should give states and counties credit for reporting race/ethnicity data for those cases even if we aren't able to use it due to privacy suppression. Below, the maps on the top left shows the percent of cases with known race/ethnicity and the map on the top right shows the percent of cases with known or suppressed race/ethnicity. The maps on the bottom show the same information at the county level.
#@title
chart_df = county_chart_df.copy(deep=True)
chart_df.reset_index(inplace=True)
chart_df.county_fips = chart_df.county_fips.astype(int)
chart_df['percent_known_cases'] = round(chart_df.total_known_cases / chart_df.total_cases, 2)
chart_df['total_known_or_na_cases'] = chart_df.total_known_cases + chart_df.na_cases
chart_df['percent_known_or_na_cases'] = round(chart_df.total_known_or_na_cases / chart_df.total_cases, 2)
#@title
cdc_known_state_fields_dict = {
'x': {'name': 'cdc_known_cases', 'format': ',', 'title': 'Known race/ethnicity cases'},
'y': {'name': 'cdc_cases', 'format': ',', 'title': 'CDC cases'},
'percent': {'name': 'cdc_known_cases_percent', 'format': '.0%', 'title': 'Percent known cases'},
}
cdc_known_state_title = 'CDC Cases with Known Race/Ethnicity by State as of %s' % date_display_name
cdc_known_state_map = CreateMap(
crdt_merged_df, cdc_known_state_fields_dict, cdc_known_state_title, total_cases_scale_max, map_height, map_width, 'state', 'percent'
)
cdc_known_or_na_state_fields_dict = {
'x': {'name': 'cdc_known_or_na_cases', 'format': ',', 'title': 'Known or suppressed race/ethnicity cases'},
'y': {'name': 'cdc_cases', 'format': ',', 'title': 'CDC cases'},
'percent': {'name': 'cdc_known_or_na_cases_percent', 'format': '.0%', 'title': 'Percent known or suppressed cases'},
}
cdc_known_or_na_state_title = 'CDC Cases with Known or Suppressed Race/Ethnicity by State as of %s' % date_display_name
cdc_known_or_na_state_map = CreateMap(
crdt_merged_df, cdc_known_or_na_state_fields_dict, cdc_known_or_na_state_title, total_cases_scale_max, map_height, map_width, 'state', 'percent'
)
(cdc_known_state_map | cdc_known_or_na_state_map).configure_view(
strokeWidth=0,
).configure_legend(
gradientLength=map_height - 50
).display()
#@title
cdc_known_county_fields_dict = {
'x': {'name': 'total_known_cases', 'format': ',', 'title': 'Known race/ethnicity cases'},
'y': {'name': 'total_cases', 'format': ',', 'title': 'CDC cases'},
'percent': {'name': 'percent_known_cases', 'format': '.0%', 'title': 'Percent known cases'},
}
cdc_known_county_title = 'CDC Cases with Known Race/Ethnicity by County as of %s' % date_display_name
cdc_known_county_map = CreateMap(
chart_df, cdc_known_county_fields_dict, cdc_known_county_title, total_cases_scale_max, map_height, map_width, 'county', 'percent'
)
cdc_known_or_na_county_fields_dict = {
'x': {'name': 'total_known_or_na_cases', 'format': ',', 'title': 'Known or suppressed race/ethnicity cases'},
'y': {'name': 'total_cases', 'format': ',', 'title': 'CDC cases'},
'percent': {'name': 'percent_known_or_na_cases', 'format': '.0%', 'title': 'Percent known or suppressed cases'},
}
cdc_known_or_na_county_title = 'CDC Cases with Known or Suppressed Race/Ethnicity by County as of %s' % date_display_name
cdc_known_or_na_county_map = CreateMap(
chart_df, cdc_known_or_na_county_fields_dict, cdc_known_or_na_county_title, total_cases_scale_max, map_height, map_width, 'county', 'percent'
)
(cdc_known_county_map | cdc_known_or_na_county_map).configure_view(
strokeWidth=0,
).configure_legend(
gradientLength=map_height - 50
).display()
#@title
#PrintSummaryStats(crdt_merged_df, field='cdc_known_cases_percent')
#PrintSummaryStats(crdt_merged_df, field='cdc_known_or_na_cases_percent')
#tuple(crdt_merged_df[crdt_merged_df.cdc_known_or_na_cases_percent <= .5].state)
Note: A larger version of the county maps for hovering over smaller counties is available in the Appendix.
We can see that the maps on the right are bluer than those on the left, which means that states and counties are doing a better job at reporting race/ethnicity when we consider the data that was suppressed for privacy reasons. This effect is most pronounced in states like Wyoming, Texas, and Louisiana, which have many counties with small populations or small population subgroups.
We can see the increase in the percent of cases with known race/ethnicity --> known or suppressed suppressed across all states:
How does the CDC data compare to the CRDT data, which is the most up-to-date aggregate data we have for race/ethnicity at the state level?
We can compare the number of cases with known race/ethnicity in each state between the CDC and CRDT data. It's worth mentioning that the CRDT captures the many non-standard ways in which different states report on race/ethnicity, where ethnicity is whether a person is Hispanic/Latino. Some states report race/ethnicity as a combined field where each race/ethnicity group is mutually exclusive, which is how the CDC reports this field. Other states report race/ethnicity as separate fields where Hispanic/Latino people are counted within different race groups as well as in a separate field for ethnicity. States can also differ in terms of which race categories they use, how they define them, whether multiracial people are counted multiple times in different categories, and what's included in the "Other" race category.
In the comparison below, we look at the number of people in the CRDT with known race within each state. If a state uses a combined race/ethnicity field, then it's a straightforward comparison to the CDC's combined race/ethnicity field. If a state uses separate fields for race/ethnicity, then we still use the number of people with known race within each state because all of the race categories will also contain Hispanic/Latino people. We could potentially be undercounting the number of people with known race/ethnicity in the CRDT is if there are people who have unknown race but known ethnicity. If we adjusted the numbers in those cases, it would make the CDC numbers look even worse in comparison to the CRDT than they already do, so we can think of the comparison below as a best-case scenario for the CDC data.
#@title
fields_dict = {
'x': {'name': 'crdt_known_race_cases', 'format': ',', 'title': 'CRDT known race/ethnicity cases'},
'y': {'name': 'cdc_known_cases', 'format': ',', 'title': 'CDC known race/ethnicity cases'},
'percent': {'name': 'percent_known_cases', 'format': '.2f', 'title': 'Ratio of CDC to CRDT'},
}
title = 'Ratio of CDC to CRDT Cases with Known Race/Ethnicity by State as of %s' % date_display_name
CreateScatterPlotAndMap(
crdt_merged_df, fields_dict, title, 1200000, scatter_height, scatter_width, map_width, 'state', 'ratio'
).display()
Notes:
The ratio of CDC to CRDT cases with known race/ethnicity is between 0.01 and 1.18 for all states excluding New York:
Only 4 states (Massachusetts, Minnesota, Utah, Washington) had more cases with known race/ethnicity in the CDC data than in the CRDT data, whereas 23 states had more total cases in the CDC data than in the CRDT data in the section above.
We can also look at the CRDT's percentage of cases with known race/ethnicity on its own, similar to charts for the CDC data in the previous section. Overall, 66% of the cases in the CRDT data have known race/ethnicity compared to 55% in the CDC data (57% with suppressed data).
#@title
crdt_known_state_fields_dict = {
'x': {'name': 'crdt_known_race_cases', 'format': ',', 'title': 'Known race/ethnicity cases'},
'y': {'name': 'crdt_cases', 'format': ',', 'title': 'CDC cases'},
'percent': {'name': 'crdt_known_race_cases_percent', 'format': '.0%', 'title': 'Percent known cases'},
}
crdt_known_state_title = 'CRDT Cases with Known Race/Ethnicity by State as of %s' % date_display_name
crdt_known_map = CreateMap(
cdc_crdt_merged_df, crdt_known_state_fields_dict, crdt_known_state_title, total_cases_scale_max, map_height, map_width, 'state', 'percent'
)
crdt_known_map.configure_view(
strokeWidth=0,
).configure_legend(
gradientLength=map_height - 50
).display()
#@title
#print(crdt_merged_df.crdt_known_race_cases.sum() / crdt_merged_df.crdt_cases.sum())
#PrintSummaryStats(cdc_crdt_merged_df, field='crdt_known_race_cases_percent')
The percent of CRDT cases with known race/ethnicity is between 0% and 99% for all states:
CRDT is a more complete source for race/ethnicity data at the state level than the CDC data in terms of both the counts of cases with race/ethnicity data and the percentage of cases with race/ethnicity data. The only exceptions to this are New York, which has no cases with race/ethnicity, and possibly Massachusetts, which has 1.18 times as many cases with race/ethnicity than the CRDT.
How can states and counties improve their data completeness for race/ethnicity data, especially when compared to the more reliable and up-to-date aggregate data that comes from public health websites, as collected by the CRDT and NYT?
There are two ways in which states can improve the data they send to the CDC:
In the Total Case Counts section above, we identified the states and counties with the biggest discrepancies relative to aggregate data. In the Cases with Race/Ethnicity section, we looked at the percentage of cases within each state and county that have race/ethnicity data.
The charts below show those two components together; the scatterplots show (1) the discrepancy vs. CRDT/NYT total case counts on the y-axis, and (2) the percentage of cases with known or suppressed race/ethnicity on the x-axis. The colors of the dots and on the map show the product of those two numbers, which is the percentage of CRDT/NYT total case counts accounted for in the CDC data with race/ethnicity. This is a composite measure of what percentage of total cases are included in the CDC data with known or suppressed race/ethnicity.
#@title
nyt_cdc_known_merged_df = chart_df.join(nyt_counties_df, on="county_fips", how='inner', lsuffix='_left', rsuffix='_right')
nyt_cdc_known_merged_df.reset_index(inplace=True)
nyt_cdc_known_merged_df['percent'] = round(nyt_cdc_known_merged_df.total_cases / nyt_cdc_known_merged_df.nyt_cases, 2)
#@title
crdt_merged_df['percent_max_100'] = crdt_merged_df.percent.clip(upper=1)
crdt_merged_df['percent_reccs'] = crdt_merged_df.percent_max_100 * crdt_merged_df.cdc_known_or_na_cases_percent
state_reccs_fields_dict = {
'y': {'name': 'percent_max_100', 'format': '.0%', 'title': 'CDC percent of CRDT cases capped at 100%'},
'x': {'name': 'cdc_known_or_na_cases_percent', 'format': '.0%', 'title': 'CDC percent with known or suppressed race/ethnicity'},
'percent': {'name': 'percent_reccs', 'format': '.0%', 'title': 'Product: CDC percent of CRDT total with race/ethnicity'},
}
state_reccs_title = 'State Completeness: Percent of Total Cases x Percent with Race/Ethnicity'
scatter = CreateScatterPlotAndMap(
crdt_merged_df, state_reccs_fields_dict, state_reccs_title, 1, scatter_height, scatter_width, map_width, 'state', 'percent'
)
scatter.configure_view(
strokeWidth=0,
).configure_legend(
gradientLength=map_height - 50
).configure_mark(
stroke='grey'
).display()
#@title
nyt_cdc_known_merged_df['percent_max_100'] = nyt_cdc_known_merged_df.percent.clip(upper=1)
nyt_cdc_known_merged_df['percent_reccs'] = nyt_cdc_known_merged_df.percent_max_100 * nyt_cdc_known_merged_df.percent_known_or_na_cases
county_reccs_fields_dict = {
'y': {'name': 'percent_max_100', 'format': '.0%', 'title': 'CDC percent of NYT cases capped at 100%'},
'x': {'name': 'percent_known_or_na_cases', 'format': '.0%', 'title': 'CDC percent with known or suppressed race/ethnicity'},
'percent': {'name': 'percent_reccs', 'format': '.0%', 'title': 'Product: CDC percent of NYT total with race/ethnicity'},
}
county_reccs_title = state_reccs_title = 'County Completeness: Percent of Total Cases x Percent with Race/Ethnicity'
scatter = CreateScatterPlotAndMap(
nyt_cdc_known_merged_df, county_reccs_fields_dict, county_reccs_title, 1, scatter_height, scatter_width, map_width, 'county', 'percent'
)
scatter.configure_view(
strokeWidth=0,
).configure_legend(
gradientLength=map_height - 50
).configure_mark(
stroke='grey'
).display()
Notes:
The scatterplots above can help us diagnose the issues in each state or county:
For states and counties to improve the percentage of cases with known race/ethnicity, most need to reduce their number of Unknowns or missing data. A few states, however, seem to have errors where some race/ethnicity groups are almost entirely missing from the data:
Another way to evaluate states with issues is to see if they are improving over time with the percentage of cases with known race/ethnicity. We can look at the 11 states in the left side of the states scatterplot above that have < 50% of cases with known or suppressed race/ethnicity.
#@title
compare_cases_unknowns_query = ('''
SELECT
res_state,
race_ethnicity_combined,
COUNT(*) as cdc_cases
FROM
%s
WHERE res_state = 'RI'
GROUP BY
res_state,
race_ethnicity_combined
''' % cdc_table)
state_df = pd.io.gbq.read_gbq(compare_cases_unknowns_query, project_id=project_id)
state_df['percent'] = round(state_df.cdc_cases / state_df.cdc_cases.sum(), 2)
#@title
cdc_states_by_month_query = ('''
SELECT
res_state,
CONCAT(EXTRACT(YEAR from cdc_case_earliest_dt), '-Q', EXTRACT(QUARTER from cdc_case_earliest_dt)) as date,
COUNT(*) as total_cases,
FROM
%s
WHERE
cdc_case_earliest_dt >= DATE(2020, 1, 1) AND
cdc_case_earliest_dt < DATE(2020, 12, 1) AND
res_state in ('AK', 'CA', 'CT', 'DE', 'GA', 'LA', 'MD', 'ND', 'NY', 'PA', 'RI')
GROUP BY
1, 2
ORDER BY
1, 2
''' % cdc_table)
cdc_states_by_month_known_or_na_query = ('''
SELECT
res_state,
CONCAT(EXTRACT(YEAR from cdc_case_earliest_dt), '-Q', EXTRACT(QUARTER from cdc_case_earliest_dt)) as date,
COUNT(*) as known_or_na_cases,
FROM
%s
WHERE
cdc_case_earliest_dt >= DATE(2020, 1, 1) AND
cdc_case_earliest_dt < DATE(2020, 12, 1) AND
race_ethnicity_combined != 'Unknown' AND
race_ethnicity_combined != 'Missing'
GROUP BY
1, 2
ORDER BY
1, 2
''' % cdc_table)
cdc_states_by_month_df = pd.io.gbq.read_gbq(cdc_states_by_month_query, project_id=project_id)
cdc_states_by_month_df.set_index(keys=['res_state', 'date'], inplace=True)
cdc_states_by_month_known_or_na_df = pd.io.gbq.read_gbq(cdc_states_by_month_known_or_na_query, project_id=project_id)
cdc_states_by_month_known_or_na_df.set_index(keys=['res_state', 'date'], inplace=True)
cdc_known_over_time = cdc_states_by_month_df.join(cdc_states_by_month_known_or_na_df, how='left')
cdc_known_over_time['percent_known_or_na'] = round(cdc_known_over_time.known_or_na_cases / cdc_known_over_time.total_cases, 2)
cdc_known_over_time.reset_index(inplace=True)
#@title
base = alt.Chart(cdc_known_over_time).mark_line(point=True).encode(
x=alt.X('date', title='CDC earliest report date', axis=alt.Axis(labelAngle=0)),
y=alt.Y('percent_known_or_na', title='Percent unknown or suppressed race/ethnicity', axis=alt.Axis(format='%')),
color=alt.Color('res_state', scale=alt.Scale(scheme='category20'), title='State')
).properties(
title='States with less than 50% of Cumulative Cases with Known or Suppressed Race/Ethnicity',
height=map_height,
width=map_width
).display()
We can see that a few of these states have improved over time and now have more than 50% of cases with known or suppressed race/ethnicity: Louisiana, New York, and Georgia. A few states started off the year with greater than 50%: Alaska, Connecticut, and Maryland. But none of these states have fully fixed their issues with the possible exception of Louisiana if it continues on its current trajectory.
The additional fields in the data, including whether the person died or was hospitalized, are all known for fewer than 50% of cases.
#@title
field_list = ['death_yn', 'hosp_yn', 'icu_yn', 'onset_dt', 'pos_spec_dt', 'hc_work_yn',
'pna_yn', 'abxchest_yn', 'acuterespdistress_yn', 'mechvent_yn', 'fever_yn', 'sfever_yn', 'chills_yn', 'myalgia_yn', 'runnose_yn',
'sthroat_yn', 'cough_yn', 'sob_yn', 'nauseavomit_yn', 'headache_yn', 'abdom_yn', 'diarrhea_yn', 'medcond_yn']
project_id = 'msm-secure-data-1b'
table = '`msm-secure-data-1b.ndunlap_secure.cdc_restricted_access_20201231`'
FieldAnalysis(project_id, table, field_list).display()
The CDC also commented on these fields in their case data FAQs:
Because of the volume of cases, most health departments are unable to conduct investigations of every case to obtain additional information. Because of this, most case reports are missing data on patient demographics, symptoms, underlying health conditions, characteristics of hospitalizations such as ventilator use, and other factors such as recent travel history.
The case report form contains many more fields, but unfortunately, the data gets less complete as you go down the form. Citizens for Responsibility and Ethics in Washington (CREW) obtained a version of this data via FOIA that contained 101 fields with data up to Aug 25, 2020 and shared it with MSM/SHLI. Several of the additional fields from that dataset are shown below; the field with the most known data is whether the case was associated with an outbreak, but even that is only known for 30% of cases.
#@title
field_list = ['death_week', 'icu_length', 'hosp_length', 'translator_yn', 'housing', 'exp_work_critical', 'outbreak_associated',
'rigors_yn', 'taste_yn', 'fatigue_yn', 'wheezing_yn', 'diffbreathing_yn', 'chestpain_yn', 'test_pcr', 'test_serologic',
'exp_adultfacility', 'exp_airport', 'exp_animal', 'exp_community', 'exp_gathering', 'exp_contact', 'exp_correctional',
'exp_ship', 'exp_house', 'exp_other', 'exp_school', 'exp_othcountry', 'exp_unk', 'exp_work']
project_id = 'msm-internal-data'
table = '`msm-internal-data.crew.covid_case_surveillance`'
FieldAnalysis(project_id, table, field_list).display()
To make it easier to hover over small counties, here are larger versions of the county maps that appeared in this report.
#@title
cdc_nyt_map = CreateMap(
nyt_merged_df, cdc_nyt_fields_dict, cdc_nyt_title, total_cases_scale_max, map_height * 2, map_width * 2, 'county', 'ratio'
).configure_view(
strokeWidth=0,
).configure_legend(
gradientLength=map_height - 50
)
cdc_nyt_map.display()
#@title
cdc_known_county_map = CreateMap(
chart_df, cdc_known_county_fields_dict, cdc_known_county_title, total_cases_scale_max, map_height * 2, map_width * 2, 'county', 'percent'
).configure_view(
strokeWidth=0,
).configure_legend(
gradientLength=map_height - 50
)
cdc_known_county_map.display()
#@title
cdc_known_or_na_county_map = CreateMap(
chart_df, cdc_known_or_na_county_fields_dict, cdc_known_or_na_county_title, total_cases_scale_max, map_height * 2, map_width * 2, 'county', 'percent'
).configure_view(
strokeWidth=0,
).configure_legend(
gradientLength=map_height - 50
)
cdc_known_or_na_county_map.display()
#@title
county_completeness = CreateMap(
nyt_cdc_known_merged_df, county_reccs_fields_dict, county_reccs_title, 1, map_height * 2, map_width * 2, 'county', 'percent'
)
county_completeness.configure_view(
strokeWidth=0,
).configure_legend(
gradientLength=map_height - 50
).configure_mark(
stroke='grey'
).display()